Introduction to Weight Quantization

Large language model optimization using 8-bit and 4-bit quantization

large language models
Author

Maxime Lbonne

Published

June 20, 2023

Large Language Models (LLMs) are known for their extensive computational requirements. This is due to their large number of parameters, stored in matrices that are multiplied to produce an output.

Typically, the size of a model is calculated by multiplying the number of parameters (size) by the precision of these values (data type). However, to save memory, weights can be stored using different data types through a process known as quantization.

In this article, we will see how to reduce the precision of these parameters while maintaining good performance. We will summarize the latest spectacular improvements in this field and apply them to a toy example using a GPT-2 model.

The entire code is freely available on Google Colab and GitHub.

Background on Weight Quantization

To give a bit of background, we distinguish two main families of weight quantization techniques in the literature:

  • Post-Training Quantization (PTQ) is a straightforward technique where the weights of an already trained model are converted to lower precision without necessitating any retraining. Although easy to implement, PTQ is associated with potential performance degradation.
  • Quantization-Aware Training (QAT) incorporates the weight conversion process during the pre-training or fine-tuning stage, resulting in enhanced model performance. However, QAT is computationally expensive and demands representative training data.

In this article, we focus on PTQ to reduce the precision of our parameters. Common precisions, also known as “floating point data types,” include float32 (FP32), float16 (FP16), and bfloat16 (BF16):

  • FP32 stands for the standardized IEEE 32-bit floating point representation, allowing for a vast range of floating numbers. This format assigns 8 bits for the exponent, 23 bits for the mantissa, and 1 bit for the sign of the number. Most hardware supports FP32 operations and instructions.

  • FP16, on the other hand, reserves 5 bits for the exponent and 10 bits for the mantissa. This considerably reduces the representable range of FP16 numbers compared to FP32, and exposes FP16 numbers to the risk of overflowing and underflowing.

  • BF16 was created to mitigate the limitations of FP16. It reserves 8 bits for the exponent (as in FP32) and 7 bits for the fraction, maintaining the same dynamic range as FP32 but with less precision than FP16.

In ML jargon, FP32 is often termed “full precision” (4 bytes), while BF16 and FP16 are “half-precision” (2 bytes). Additionally, the int8 (INT8) data type consists of an 8-bit representation capable of storing 2^8 = 256 different values.

Let’s see how to convert FP32 weights into an INT8 format.

Naive 8-bit Quantization

In quantization, the original data is “rounded” from one data type to another, leading to a lossy compression and potential information loss. Two popular 8-bit quantization techniques are zero-point quantization and absolute maximum (absmax) quantization. These techniques map floating-point values into the more compact int8 (1 byte) values.

With zero-point quantization, values in a specific range (e.g., -1.0 to 1.0) are scaled by a factor (e.g., 127 for the range -127 to 127) and rounded to the nearest 8-bit precision. To retrieve the original value, the int8 value is divided by the same quantization factor.

Absmax quantization operates slightly differently. To map an FP16 number to an int8 number, the original number is divided by the absolute maximum value of the tensor and multiplied by the total range of the data type. To retrieve the original FP16 values, the int8 number is divided by the quantization factor, acknowledging some loss of precision due to rounding.

These quantization techniques can reduce the size of models dramatically while preserving most of their performance, making them valuable tools in efficient deployment of machine learning models.

Let’s implement it using the transformers library. We start by loading the model and tokenizer for GPT-2 using the Hugging Face’s transformers library. We want to observe the model’s size before and after the quantization process to evaluate the potential memory savings.

!pip install -q bitsandbytes>=0.39.0
!pip install -q git+https://github.com/huggingface/accelerate.git
!pip install -q git+https://github.com/huggingface/transformers.git

# import locale
# locale.getpreferredencoding = lambda: "UTF-8"
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(42)

# Set device to CPU for now
device = 'cpu'

# Load model and tokenizer
model_id = 'gpt2'
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f"Model size: {model.get_memory_footprint():,} bytes")
Model size: 510,342,192 bytes

We want to quantize these weights. We create a function that computes the absolute maximum of the tensor, which is used as a scaling factor to normalize the tensor values. The normalized tensor values are then rounded to nearest integers and stored in int8 format. The function also returns a dequantized version of the tensor for comparison, where the quantized tensor is scaled back by the original absolute maximum.

In the following example, we apply it to the first attention layer of the GPT-2 model.

def absmax_quantize(tensor, num_bits=8):
    # Calculate absolute max
    abs_max = torch.max(torch.abs(tensor))

    # Calculate scale
    scale = (2**(num_bits-1) - 1) / abs_max

    # Quantize
    tensor_quantized = (tensor * scale).round()

    # Dequantize
    tensor_dequantized = tensor_quantized / scale

    return tensor_quantized.to(torch.int8), tensor_dequantized

# Extract weights of the first layer
weights = model.transformer.h[0].attn.c_attn.weight.data
print(weights)
print()

# Quantize layer
weights_quantized, weights_dequantized = absmax_quantize(weights)
print(weights_quantized)
tensor([[-0.4738, -0.2614, -0.0978,  ...,  0.0513, -0.0584,  0.0250],
        [ 0.0874,  0.1473,  0.2387,  ..., -0.0525, -0.0113, -0.0156],
        [ 0.0039,  0.0695,  0.3668,  ...,  0.1143,  0.0363, -0.0318],
        ...,
        [-0.2592, -0.0164,  0.1991,  ...,  0.0095, -0.0516,  0.0319],
        [ 0.1517,  0.2170,  0.1043,  ...,  0.0293, -0.0429, -0.0475],
        [-0.4100, -0.1924, -0.2400,  ..., -0.0046,  0.0070,  0.0198]])

tensor([[-21, -12,  -4,  ...,   2,  -3,   1],
        [  4,   7,  11,  ...,  -2,  -1,  -1],
        [  0,   3,  16,  ...,   5,   2,  -1],
        ...,
        [-12,  -1,   9,  ...,   0,  -2,   1],
        [  7,  10,   5,  ...,   1,  -2,  -2],
        [-18,  -9, -11,  ...,   0,   0,   1]], dtype=torch.int8)

We clearly see the difference betweent the original (floats) and quantized tensors (integers between -128 and 127).

After that, we define a function generate_text() we will use it to compare the text generated by the original model and the quantized model. To quantize it, we apply the absmax_quantize() function on all weights. Note that we’re replacing the original weights with the dequantized ones. Indeed, gradients require floating points to be calculated. In a real scenario, we would dequantize them to run the model (in FP16 for example) but store them as INT8.

import numpy as np

# Define the text generation function
def generate_text(model, input_text, max_length=50):
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(device)
    pad_token_id = tokenizer.eos_token_id
    output = model.generate(inputs=input_ids, max_length=max_length, do_sample=True, temperature=0.7,
                            attention_mask=attention_mask, pad_token_id=pad_token_id)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Generate text with original model
original_text = generate_text(model, "I have a dream")

# Quantize all model weights
weights = []
weights_quant = []
for p in model.parameters():
    weights.append(p.data)
    _, dequantized = absmax_quantize(p.data)
    p.data = dequantized
    weights_quant.append(dequantized)

# Generate text with quantized model
quantized_text = generate_text(model, "I have a dream")

Before we print the generated text, I want to check the impact of the quantization on the weights. Intuitively, we want to make sure that the quantized weights are close to the original ones. A way to check it is to plot the distribution of the dequantized and original weights. A lossy quantization would drastically change the weight distribution.

The following figure shows this comparison, where the blue histogram represents the original (FP32) weights, and the red one represents the dequantized (from INT8) weights. Note that we only display this plot between -2 and 2.

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Flatten weight tensors
weights = np.concatenate([t.cpu().numpy().flatten() for t in weights])
weights_quant = np.concatenate([t.cpu().numpy().flatten() for t in weights_quant])

# Set background style
plt.style.use('ggplot')

# Create figure and axis
fig, ax = plt.subplots(figsize=(10,5), dpi=300)

# Plot the histograms
ax.hist(weights, bins=150, alpha=0.5, label='Original FP32 weights', color='blue', range=(-2, 2))
ax.hist(weights_quant, bins=150, alpha=0.5, label='Dequantized INT8 weights', color='red', range=(-2, 2))

# Add grid
ax.grid(True, linestyle='--', alpha=0.6)

# Add legend
ax.legend()

# Add title and labels
ax.set_title('Comparison of Original and Dequantized Weights', fontsize=16)
ax.set_xlabel('Weights', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
plt.gca().yaxis.set_major_formatter(ticker.EngFormatter()) # Make y-ticks more human readable

# Improve font
plt.rc('font', size=12)

plt.tight_layout()
plt.show()

We observe a surprising spike around 0. This spike shows that our quantization is quite lossy since reversing the process doesn’t output the original values.

Let’s verifiy that by printing the output of each model.

print(f"Original model:  {original_text}")
print(f"Quantized model: {quantized_text}")
Original model:  I have a dream.

On that day, I made a joke, like, 'Fuck you, I'm going to do this for you.'

I was getting up at 7:45 in the morning, and I was like,
Quantized model: I have a dream of getting to know your enemy, but it is impossible. I have a dream of getting to know your enemies. But it is impossible. Why do you think I am capable of doing this? Because feathers are my best friend.

None of the outputs are particularly good, but it feels like the second one is very random. We can try to quantify this intuition by calculating the perplexity of each output. Perplexity is a common metric used to evaluate language models. It measures the uncertainty of a model in predicting the next token in a sequence.

We implement it using a quick function since it doesn’t need to consider details like the length of the context window here (our sentences are short).

def calculate_perplexity(model, text):
    # Encode the text
    encodings = tokenizer(text, return_tensors='pt').to(device)

    # Define input_ids and target_ids
    input_ids = encodings.input_ids
    target_ids = input_ids.clone()

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

    # Loss calculation
    neg_log_likelihood = outputs.loss

    # Perplexity calculation
    ppl = torch.exp(neg_log_likelihood)

    return ppl

ppl = calculate_perplexity(model, original_text)
print(f"Perplexity (original):  {ppl.item()}")

ppl = calculate_perplexity(model, quantized_text)
print(f"Perplexity (quantized): {ppl.item()}")
Perplexity (original):  10.820199012756348
Perplexity (quantized): 10.92601490020752

As expected, the perplexity of the first output is much lower than the second one. We could repeat this process for multiple generations and find similar results. This shows that the quality of the generated text dropped because of the quantization process.

In the next section, we will see how we can do better using a more efficient INT8 quantization method.

8-bit Quantization with LLM.int8()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', load_in_8bit=True)
print(f"Model size: {model.get_memory_footprint():,} bytes")

# Generate text with quantized model
text_llm_int8 = generate_text(model, "I have a dream")
print(text_llm_int8)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Model size: 261,462,552 bytes
I have a dream. I don't know what it is, but I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I have a dream. I
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/lib64-nvidia did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('http'), PosixPath('//172.28.0.1')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--logtostderr --listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-202grgpeh6xn8 --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  warn(msg)
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
ppl = calculate_perplexity(model, original_text)
print(f"Perplexity (original): {ppl.item()}")

ppl = calculate_perplexity(model, text_llm_int8)
print(f"Perplexity (LLM.int8()): {ppl.item()}")
Perplexity (original): 3.146484375
Perplexity (LLM.int8()): 2.3359375

4-bit Quantization with NF4

!pip install -q auto-gptq

model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', load_in_4bit=True)
print(f"Model size: {model.get_memory_footprint():,} bytes")

# Generate text with quantized model
text_nf4 = generate_text(model, "I have a dream")
print(text_nf4)
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.
Model size: 261,462,552 bytes
I have a dream. I have a dream of having a family, of having a job. And I do it all with love. I love my love. And I don't have a dream anymore. Don't be afraid to be a good person
ppl = calculate_perplexity(model, original_text)
print(f"Perplexity (original): {ppl.item()}")

ppl = calculate_perplexity(model, text_nf4)
print(f"Perplexity (NF4): {ppl.item()}")
Perplexity (original): 3.146484375
Perplexity (NF4): 8.2109375
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_id = "gpt2"
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)
examples = [
    "What's the weather like today in New York City? I'm planning to visit Central Park in the afternoon.",
    "Hey, check out this article on the benefits of a plant-based diet: https://www.example.com/plant-based-diet",
    "Can you recommend any good science fiction books? I love stories about time travel and space exploration.",
    "How can I learn a new language quickly? I'm planning to move to Spain next year and need to learn Spanish.",
    "Just finished watching The Matrix. What are some other popular sci-fi movies to watch this weekend?",
    "I'm feeling stressed lately. What are some effective ways to deal with stress and improve my mental health?",
    "What are the top tourist attractions in Paris? I'll be visiting the city for a week and want to make the most of my time there.",
    "Tell me a joke about computers. I need something to cheer me up after a long day at work.",
    "How do I cook spaghetti carbonara? Can you share a simple recipe that I can follow at home?",
    "Can you give me a brief summary of the latest news? I haven't had the chance to catch up on current events.",
    "What's the difference between a psychologist and a psychiatrist? I'm considering therapy but not sure which one to see.",
    "I'm planning to start my own online store. What are the steps to start a small business and make it successful?",
]

examples = [tokenizer(text, truncation=True) for text in examples]

model.quantize(
    examples,
    use_triton=True,
    autotune_warmup_after_quantized=True,
    batch_size=1,
)
WARNING:auto_gptq.modeling._utils:using autotune_warmup will move model to GPU, make sure you have enough VRAM to load the whole model.
100%|██████████| 11/11 [03:16<00:00, 17.84s/it]
text_nf4 = generate_text(model, "I have a dream")
print(text_nf4)
I have a dream job. I'm looking forward to being at the top of the standings. No excuses. I have to stay where I am and try to win as many games as I can. I'm just trying to be more than the team
ppl = calculate_perplexity(model, original_text)
print(f"Perplexity (original): {ppl.item()}")

ppl = calculate_perplexity(model, text_nf4)
print(f"Perplexity (NF4): {ppl.item()}")
Perplexity (original): 3.517578125
Perplexity (NF4): 9.4296875

References